seo

A [Poorly] Illustrated Guide to Google’s Algorithm

Like all great literature, this post started as a bad joke on Twitter on a Friday night:

If you know me, then this kind of behavior hardly surprises you (and I probably owe you an apology or two). What’s surprising is that Google’s Matt Cutts replied, and fairly seriously:

Matt’s concern that even my painfully stupid joke could be misinterpreted demonstrates just how confused many people are about the algorithm. This tweet actually led to a handful of very productive conversations, including one with Danny Sullivan about the nature of Google’s “Hummingbird” update.

These conversations got me thinking about how much we oversimplify what “the algorithm” really is. This post is a journey in pictures, from the most basic conception of the algorithm to something that I hope reflects the major concepts Google is built on as we head into 2014.

The Google algorithm

There’s really no such thing as “the” algorithm, but that’s how we think about it—as some kind of monolithic block of code that Google occasionally tweaks. In our collective SEO consciousness, it looks something like this:

So, naturally, when Google announces an “update”, all we see are shades of blue. We hear about a major algorithm update ever month or two, and yet Google confirmed 665 updates (technically, they used the word “launches”) in 2012—obviously, there’s something more going on here than just changing a few lines of code in some mega-program.

Inputs and outputs

Of course, the algorithm has to do something, so we need inputs and outputs. In the case of search, the most fundamental input is Google’s index of the worldwide web, and the output is search engine result pages (SERPs):

Simple enough, right? Web pages go in, [something happens], search results come out. Well, maybe it’s not quite that simple. Obviously, the algorithm itself is incredibly complicated (and we’ll get to that in a minute), but even the inputs aren’t as straightforward as you might imagine.

First of all, the index is really roughly a dozen data centers distributed across the world, and each data center is a miniature city unto itself, linked by one of the most impressive global fiber optic networks ever built. So, let’s at least add some color and say it looks something more like this:

Each block in that index illustration is a cloud of thousands of machines and an incredible array of hardware, software and people, but if we dive deep into that, this post will never end. It’s important to realize, though, that the index isn’t the only major input into the algorithm. To oversimplify, the system probably looks more like this:

The link graph, local and maps data, the social graph (predominantly Google+) and the Knowledge Graph—essentially, a collection of entity databases—all comprise major inputs that exist beyond Google’s core index of the worldwide web. Again, this is just a conceptualization (I don’t claim to know how each of these are actually structured as physical data), but each of these inputs are unique and important pieces of the search puzzle.

For the purposes of this post, I’m going to leave out personalization, which has its own inputs (like your search history and location). Personalization is undoubtedly important, but it impacts many areas of this illustration and is more of a layer than a single piece of the puzzle.

Relevance, ranking and re-ranking

As SEOs, we’re mostly concerned (i.e. obsessed) with ranking, but we forget that ranking is really only part of the algorithm’s job. I think it’s useful to split the process into two steps: (1) relevance, and (2) ranking. For a page to rank in Google, it first has to make the cut and be included in the list. Let’s draw it something like this:

In other words, first Google has to pick which pages match the search, and then they pick which order those pages are displayed in. Step (1) relies on relevance—a page can have all the links, +1s, and citations in the world, but if it’s not a match to the query, it’s not going to rank. The Wikipedia page for Millard Fillmore is never going to rank for “best iPhone cases,” no matter how much authority Wikipedia has. Once Wikipedia clears the relevance bar, though, that authority kicks in and the page will often rank well.

Interestingly, this is one reason that our large-scale correlation studies show fairly low correlations for on-page factors. Our correlation studies only measure how well a page ranks once it’s passed the relevance threshold. In 2013, it’s likely that on-page factors are still necessary for relevance, but they’re not sufficient for top rankings. In other words, your page has to clearly be about a topic to show up in results, but just being about that topic doesn’t mean that it’s going to rank well.

Even ranking isn’t a single process. I’m going to try to cover an incredibly complicated topic in just a few sentences, a topic that I’ll call “re-ranking.” Essentially, Google determines a core ranking and what we might call a “pure” organic result. Then, secondary ranking algorithms kick in—these include local results, social results, and vertical results (like news and images). These secondary algorithms rewrite or re-rank the original results:

To see this in action, check out my post on how Google counts local results. Using the methodology in that post, you can clearly see how Google determines a base set of rankings, and then the local algorithm kicks in and not only adds new features but re-ranks the original results. This diagram is only the tip of the iceberg—Bill Slawski has an excellent three-part series on re-ranking that covers 40 different ways Google may re-rank results.

Special inputs: penalties and disavowals

There are also special inputs (for lack of a better term). For example, if Google issues a manual penalty against a site, that has to be flagged somewhere and fed into the system. This may be part of the index, but since this process is managed manually and tied to Google Webmaster Tools, I think it’s useful to view it as a separate concept.

Likewise, Google’s disavow tool is a separate input, in this case one partially controlled by webmasters. This data must be periodically processed and then fed back into the algorithm and/or link graph. Presumably, there’s a semi-automated editorial process involved to verify and clean this user-submitted data. So, that gives us something like this:

Of course, there are many inputs that feed other parts of the system. For example, XML sitemaps in Google Webmaster Tools help shape the index. My goal it to give you a flavor for the major concepts. As you can see, even the “simple” version is quickly getting complicated.

Updates: Panda, Penguin and Hummingbird

Finally, we have the algorithm updates we all know and love. In many cases, an update really is just a change or addition to some small part of Google’s code. In the past couple of years, though, algorithm updates have gotten a bit more tricky.

Let’s start with Panda, originally launched in February of 2011. The Panda update was more than just a tweak to the code—it was (and probably still is) a sub-algorithm with its own data structures, living outside of the core algorithm (conceptually speaking). Every month or so, the Panda algorithm would be re-run, Panda data would be updated, and that data would feed what you might call a Panda ranking factor back into the core algorithm. It’s likely that Penguin operates similarly, in that it’s a sub-algorithm and separate data set. We’ll put them outside of the big, blue oval:

I don’t mean to imply that Panda and Penguin are the same—they operate in very different ways. I’m simply suggesting that both of these algorithm updates rely on their own code and data sources and are only periodically fed back into the system.

Why didn’t Google just re-write the algorithm to account for the Panda and/or Penguin intent? Part of it is computational—the resources required to process this data are beyond what the real-time infrastructure can probably handle. As Google gets faster and more powerful, these sub-algorithms may become fully integrated (and Panda is probably more integrated than it once was). The other reason may involve testing and mitigating impact. It’s likely that Google only updates Penguin periodically because of the large impact that the first Penguin update had. This may not be a process that they simply want to let loose in real-time.

So, what about the recent Hummingbird update? There’s still a lot we don’t know, but Google has made it pretty clear that Hummingbird is a fundamental rewrite of how the core algorithm works. I don’t think we’ve seen the full impact of Hummingbird yet, personally, and the potential of this new code may be realized over months or even years, but now we’re talking about the core algorithm(s). That leads us to our final image:

Image credit for hummingbird silhouette: Michele Tobias at Experimental Craft.

The end result surprised even me as I created it. This was the most basic illustration I could make that didn’t feel misleading or simplistic. The reality of Google today far surpasses this diagram—every piece is dozens of smaller pieces. I hope, though, that this gives you a sense for what the algorithm really is and does.

Additional resources

If you’re new to the algorithm and would like to learn more, Google’s own “How Search Works” resource is actually pretty interesting (check out the sub-sections, not just the scroller). I’d also highly recommend Chapter 1 of our Beginner’s Guide: “How Search Engines Operate.” If you just want to know more about how Google operates, Steven Levy’s book “In The Plex” is an amazing read.

Special bonus nonsense!

While writing this post, the team and I kept thinking there must be some way to make it more dynamic, but all of our attempts ended badly. Finally, I just gave up and turned the post into an animated GIF. If you like that sort of thing, then here you go…

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button